Introduction to Exploratory Data Analysis and Applied Statistical Techniques

Module 01

Ray J. Hoobler

Data Visualization

Visualizations Are Not New

1977

“The simple graph has brought more information nto the data analyst’s mind than any other device.”

—John Tukey

Exploratory Data Analysis by John Tukey(Tukey 1977), is now considered a classic in the field of data analysis and statistics.

Four chapters are devoted to Graphic Presentation in my copy of Applied General Statistics (Croxton and Cowden 1946). (The book was first published in 1939.)

R for Data Science (2e)

2023

R for Data Science is an introduction into data manipulation and visualization. The authors are proponents of the tidyverse and ggplot2. The tidyverse is a collection of R packages designed for data science. This is in contrast to base R.

The tidyverse provides an integrated framework that allows beginners to quickly get up to speed with data manipulation.

ggpot2 is a plotting system for R, based on the grammar of graphics. Once you become familiar with ggplot, you will see it’s presence in many publications. A Layered Grammar of Graphics (Wickham 2010) provides the philosophical framework for ggplot2.

Prerequisites

Before you begin any readings, you should have R and RStudio installed on your computer.

Follow the instructions on the Posit.co website for installing the RStudio IDE (integrated development environment).

  1. Install R from the Rstudio.com mirror of the CRAN website.
  2. Install RStudio from Posit.co.

Getting Started

Once you have R and RStudio installed, start RStudio and type library(tidyverse) in the console.

Code
library(tidyverse)


You’ll see the following message the first time you load the package.

The Palmer Penguins Dataset

The Palmer Penguins dataset is a popular dataset for learning data visualization. It is bundled with the palmerpenguins package. The dataset was created by Allison Horst, Alison Hill, and Kristen Gorman. The dataset is available on GitHub.

Code
library(palmerpenguins)

Data Frames

Data frames will be the default data structure we use in this course. Data frames should look familiar to anyone who has used spreadsheets.

Code
penguins

Variables are in columns and observations are in rows.

“Ultimate goal” for Chapter 1 in R for Data Science

Code
library(ggthemes)

ggplot(
  data = penguins, 
  mapping = aes(x = flipper_length_mm, y = body_mass_g)
  ) + 
  geom_point(mapping = aes(color = species)) +
  labs(
    title = "Body mass and flipper length",
    subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins",
    x = "Flipper Length (mm)",
    y = "Body Mass (g)",
    color = "Species"
  ) +
  scale_color_colorblind()

Visualizations with ggplot: Step 1

Code
ggplot(data = penguins)

Visualilzations with ggplot: Step 2

Code
ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g)
)

Visualizations with ggplot: Step 3

Code
ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g)
) + 
  geom_point()

Warning

Warning: Removed 2 rows containing missing values or values outside the scale range (geom_point()).

Visualizations with ggplot: Step 4

Code
ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)
) + 
  geom_point()

Visualizations with ggplot: Step 5

Code
ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)
) + 
  geom_point() +
  geom_smooth(method = "lm")

Important

When aesthetic mappings are defined in the ggplot() function, they are inherited by all layers.

The aesthetic “color” is being applied to both the geom_point() and geom_smooth() layers.

Visualizations with ggplot: Step 6

Code
ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
  geom_point(mapping = aes(color = species)) +
  geom_smooth(method = "lm")

Visualizations with ggplot: Step 7

Code
ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
  geom_point(mapping = aes(color = species, shape = species)) +
  geom_smooth(method = "lm")

Visualizations with ggplot: Step 8

```{r}
ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
  geom_point(mapping = aes(color = species, shape = species)) +
  geom_smooth(method = "lm") +
  labs(
    title = "Body mass and flipper length",
    subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins",
    x = "Flipper Length (mm)",
    y = "Body Mass (g)",
    color = "Species",
    shape = "Species"
  ) +
  scale_color_colorblind()
```

Code
ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
  geom_point(mapping = aes(color = species, shape = species)) +
  geom_smooth(method = "lm") +
  labs(
    title = "Body mass and flipper length",
    subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins",
    x = "Flipper Length (mm)",
    y = "Body Mass (g)",
    color = "Species",
    shape = "Species"
  ) +
  scale_color_colorblind()

Module 1 Assignment 1

Create a new Quarto html document and answer questions 1 through 10 in the R for Data Science section 1.2.5 Exercises.

Exploratory Data Analysis

NIST/SEMATECH e-Handbook of Statistical Methods

The NIST/SEMATECH e-Handbook of Statistical Methods is a collaborative project involving the National Institute of Standards and Technology (NIST) and SEMATECH.

NIST is a non-regulatory federal agency within the U.S. Department of Commerce. The main role of NIST is to promote U.S. innovation and industrial competitiveness by advancing measurement science, standards, and technology.

SEMATECH was a research consortium comprised of semiconductor manufacturers and suppliers.

What is EDA According to NIST/SEMATECH?

1.1.1

Exploratory Data Analysis (EDA) is an approach/philosophy for data analysis that employs a variety of techniques (mostly graphical) to

  • maximize insight into a data set;
  • uncover underlying structure;
  • extract important variables;
  • detect outliers and anomalies;
  • test underlying assumptions;
  • develop parsimonious models; and
  • determine optimal factor settings.

EDA Techniques Encouraged by NIST/SEMATECH

1.1.1

The particular graphical techniques emplooyed in EDA are often quite simple, consisting of various techniques of:

  • Plotting the raw data. (Scatter plots, histograms, probability plots, etc.)
  • Plotting simple statistics. (Mean plots, standard deviation plots, box plots, etc.)
  • Positioning such plots to maximize our natural pattern-recognition abilities, such as using multiple plots per page. (Subplots, faceting, etc.)

EDA Goals According to NIST/SEMATECH

1.1.4

The primary goal of EDA is to maximize the analyst’s insight into a data set and into the underlying structure of a data set, while providing all of the specific items that an analyst would want to extract from a data set, such as:

  • a good-fitting, parsimonious model;
  • a list of outliers;
  • a sense of robustness of conclusions;
  • estimates for model parameters;
  • uncertainties for those estimates;
  • a ranked list of important factors;
  • conclusions as to whether individual factors are significant;
  • optimal settings.

EDA Assumptions According to NIST/SEMATECH

2.1

Data from a process or experiment “behaves like”

  • a random drawing;
  • from a fixed distribution;
  • with the distribution having a fixed location; and
  • with the distribution having a fixed variation.

Visualizing Distributions

Distribution of a Categorical Variable (1/2)

What are categorical variables in the Palmer Penguins dataset?
What do we calcluate with distributions?

Code
penguins

Distribution of a Categorical Variable (2/2)

Code
ggplot(penguins, aes(x = species)) +
  geom_bar()

Code
ggplot(penguins, aes(x = fct_infreq(species))) +
  geom_bar()

Distribution of a Numerical Variable (1/2)

Code
ggplot(penguins, aes(x = body_mass_g)) +
  geom_histogram(binwidth = 100, na.rm = TRUE)

Code
ggplot(penguins, aes(x = body_mass_g)) +
  geom_density(na.rm = TRUE)

Distribution of a Numerical Variable (2/2)

Code
ggplot(penguins, aes(x = body_mass_g, y=after_stat(density))) +
  geom_histogram(binwidth = 100, na.rm = TRUE, fill = "grey", color="black") +
  geom_density(kernel = "gaussian", bw = 200, na.rm = TRUE, color = "red")

Visualizing Relationships

A Numerical and a Categorical Cariable

What are the key components of a boxplot?

Code
ggplot(penguins, aes(x = species, y = body_mass_g)) +
  geom_boxplot(na.rm = TRUE)

What relationships are visible in a density plot?

Code
ggplot(penguins, aes(x = body_mass_g, color = species)) +
  geom_density(linewidth = 1, na.rm = TRUE)

Two Categorical Variables

Code
ggplot(penguins, 
       aes(x = island, fill = species)) +
  geom_bar()

Code
ggplot(penguins, 
       aes(x = island, fill = species)) +
  geom_bar(position = "fill")

Code
ggplot(penguins, 
       aes(x = island, fill = species)) +
  geom_bar(position = position_dodge(preserve = "single"))

Two Numerical Variables

Code
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point()

Three or more variables

Mapping variables to aesthetics

Code
ggplot(penguins, aes(x = flipper_length_mm, 
                     y = body_mass_g, 
                     color = species, shape = island)) +
  geom_point()

Faceting

Code
ggplot(penguins, aes(x = flipper_length_mm, 
                     y = body_mass_g, 
                     color = species, shape = species)) +
  geom_point() +
  facet_wrap(~island)

Module 1 Assignment 2

Create a new Quarto html document and answer questions 1, 2, and 6 in the R for Data Science section 1.5.5 Exercises.

End of Module 1

References

Croxton, Frederick E., and Dudley J. Cowden. 1946. Applied General Statistics. New York: Prentice-Hall.
Tukey, John Wilder. 1977. Exploratory Data Analysis. Addison-Wesley Series in Behavioral Science. Reading, Mass: Addison-Wesley Pub. Co.
Wickham, Hadley. 2010. “A Layered Grammar of Graphics.” Journal of Computational and Graphical Statistics 19 (1): 3–28. https://doi.org/10.1198/jcgs.2009.07098.